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Abstract 



This paper presents a general and efficient framework for probabilistic inference and learning from 
arbitrary uncertain information. It exploits the calculation properties of finite mixture models, conju- 
gate famiUes and factorization. Both the joint probability density of the variables and the likelihood 
fimction of the (objective or subjective) observation are approximated by a special mixture model, in 
such a way that any desired conditional distribution can be directly obtained without numerical inte- 
gration. We have developed an extended version of the expectation maximization (EM) algorithm to 
estimate the parameters of mixture models from uncertain training examples (indirect observations). As 
a consequence, any piece of exact or uncertain information about both input and output values is con- 
sistently handled in the inference and learning stages. This ability, extremely useful in certain situa- 
tions, is not found in most alternative methods. The proposed framework is formally justified from 
standard probabilistic principles and illustrative examples are provided in the fields of nonparametric 
pattern classification, nonlinear regression and pattern completion. Finally, experiments on a real appli- 
cation and comparative results over standard databases provide empirical evidence of the utiUty of the 
method in a wide range of applications. 

1. Introduction 

The estimation of unknown magnitudes from available information, in tlie form of sensor meas- 
urements or subjective judgments, is a central problem in many fields of science and engineering. 
To solve this task, the domain must be accurately described by a model able to support the desired 
range of inferences. When satisfactory models cannot be derived from first principles, approxima- 
tions must be obtained from empirical data in a learning stage. 

Consider a domain Z composed by a collection of objects z ={z\ z, z"), represented by 
vectors of n attributes. Given some partial knowledge S (expressed in a general form explained 
later) about a certain object z, we are interested in computing a good estimate z(S) , close to the 

true z. We allow heterogeneous descriptions; any attribute z may be continuous, discrete, or sym- 
bolic valued, including mixed types. If there is a specific subset of unknown or uncertain attributes 
to be estimated, the attribute vector can be partitioned as z = (x, y), where j £ z denotes the target 
or output attributes. The target attributes can be different for different objects z. This scenario in- 
cludes several usual inference paradigms. For instance, when there is a specific target symbolic 
attribute, the task is called pattern recognition or classification; when the target attribute is con- 
tinuous, the inference task is called regression or fiinction approximation. In general, we are inter- 
ested in a general framework for pattern completion from partially known objects. 

Example 1 : To illustrate this setting, assume that the preprocessor of a hypothetical computer 
vision system obtains features of a segmented object. The instances of the domain are described 
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by the following n=7 attributes: Area: z e [R, Color: z e {white, black, red, ...}, DISTANCE: 
z? e [R, Shape: z^ e {circular, rectangular, triangular, ...}, TEXTURE: z e {soft, rough, ...}, 
ObjectType: ^ e {door, window, ...} and ANGLE: e K. A typical instance may be z = (78, 
blue, 3.4, triangular, soft, window, 45). If the object is partially occluded or 3 -dimensional, 
some attributes will be missing or uncertain. For instance, the available information S about z 
could be expressed as (74±8, blue OR black, 3.4, triangular, ?, window 70% door 30%, ?), 
where z , z , z are uncertain, z , z are exact and , z are missing. In this case we could be in- 
terested in estimates forj = {z, z, z!] and even in improving our knowledge on z' and z^. 

The non-deterministic nature of many real world domains suggests a probabilistic approach, 
where attributes are considered as random variables. Objects are assumed to be drawn independ- 
ently and identically distributed from p{z) - p(z\ z") - p(x, y), the multivariate joint probability 
density function of the attributes, which completely characterizes the ^-dimensional random vari- 
able z. To simplify notation, we will use the same function symbol p(-) to denote different p.d.f.'s if 
they can be identified without risk of confusion. 

According to Statistical Decision Theory (Berger 1985), optimum estimators for the desired attrib- 
utes are obtained through minimization of a suitable expected loss function: 



where L(y, y) is the loss incurred when the true y is estimated by y . Estimators are always fea- 
tures of the conditional or posterior distribution p(y\S) of the target variables given the available 
information. For instance, the minimum squared error (MSE) estimator is the posterior mean, the 
minimum linear loss estimator is the posterior median and the minimum error probability (EP, 0-1 
loss) estimator is the posterior mx)de. 

Example 2 : A typical problem is the prediction of an unknown attribute y from the observed at- 
tributes X. In this case the available information can be written as S = (x, ?). If y is continuous, it 
is reasonable to use the MSE estimator: JmseC'^) = ^{y I , the general regression function. If y 

is symbolic and the same loss is associated to all errors, the EP estimator is adequate: 

y^p{S) =argmaxy p(y\x) = argmaxy p(x\y)p(y). It corresponds to XheMaximumA Posteriori rule 
or Bayes Test, widely used in Statistical Pattern Recognition. 

The joint density p{z) - p(x, y) plays an essential role in the inference process. It implicitly 
includes complete information about attribute dependences. In principle, any desired conditional 
distribution or estimator can be computed from the joint density by adequate integration. Probabil- 
istic Inference is the process of computing the desired conditional probabilities from a (possibly 
implicit) joint distribution. From p{z) (the prior, model of the domain, comprising implications) and 
S (a known event, somewhat related to a certain z), we could obtain the posterior p{z\S) and the 
desired target marginal piy\S) (the probabilistic "consequent"). 

Example 3 : If we observe an exact value JCo in attribute x, i.e. S= {x= Xo], we have: 



.y OPT 

{S) = argmin^ B{L{y,y)\S] 




Pix„,y) 




If we know that instance z is in a certain region R in the attribute space, i.e. S = {z e R}, we 
compute the marginal density of y from the joint p{z) = p{x, y) restricted to region R (Fig. 1): 
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p(y\S)=ip(x,y\{(x,y)GR})dx 



\j^p(.x,y)dx 



\\p{x,y)dxdy 



R 



More general types of uncertain information S about z will be discussed later. 
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Figure 1. The conditional probability density ofy, assuming z = (xy) e R. 



In summary, from the joint density p(z) of a multivariate random variable, any subset of vari- 
ables J c z may be, in principle, estimated given the available information S about the whole z - (x, 
y). In practical situations, two steps are required to solve the inference problem. First, a good model 
of the true joint density p(z) must be obtained. Second, the available information S must be effi- 
ciently processed to improve our knowledge about future, partially specified instances z. These two 
complementary aspects, learning and inference, are approached from many scientific fields, pro- 
viding different methodologies to solve practical applications. 

From the point of view of Computer Science, the essential goal of Inductive Inference is to 
find an approximate intensional definition (properties) of an unknown concept (subset of the do- 
main) from an incomplete extensional definition (finite sample). Machine Learning techniques 
(Michalski, Carbonell & Mitchell 1977, 1983, Hutchinson 1994) provide practical solutions (e.g. 
automatic construction of decision trees) to solve many situations where explicit programming 
must be avoided. Computational Learning Theory (Valiant 1993, Wolpert 1994, Vapnik 1995) 
studies the feasibility of induction in terms of generalization ability and resource requirements of 
different learning paradigms. 

Under the general setting of Statistical Decision Theory, modeling techniques and the opera- 
tional aspects of inference (based in numerical integration, Monte Carlo simulation, analytic ap- 
proximations, etc.) are extensively studied from the Bayesian perspective (Berger 1985, Bernardo 
& Smith 1994). In the more specific field of Statistical Pattern Recognition (Duda & Hart 1973, 
Fukunaga 1990), standard parametric or nonpar ame trie density approximation techniques (Izenman 
1991) are used to learn from training data the class-conditional p.d.f.'s required by the optimum 
decision rule. For instance, if the class-conditional densities p(x\y) are Gaussian, the required pa- 
rameters are the mean vector and covariance matrix of the feature vector in each class and the deci- 
sion regions for y in x have quadratic boundaries. Among the nonparametric classification tech- 
niques, the Parzen method and the K-N Nearest Neighbors rule must be mentioned. Analogously, if 
the target attribute is continuous and the statistical dependence between input and output variables 
p{x,y) can be properly modeled by joint normality, we get multivariate linear regression: y Msdx) = 
Ax + B, where the required parameters are the mean values and the covariance matrix of the attrib- 
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utes. Nonlinear regression curves can be also derived from nonparametric approximation tech- 
niques. Nonparametric methods present slower convergence rates, requiring significantly larger 
sample sizes to obtain satisfactory approximations; they are also strongly affected by the dimen- 
sionality of the data and the selection of the smoothing parameter is a crucial step. In contrast, they 
only require some kind of smoothness assumption on the target density. 

Neural Networks (Hertz et. al 1991) are computational models trainable from empirical data 
that have been proposed to solve more complex situations. Their intrinsic parallel architecture is 
especially efficient in the inference stage. One of the most widely used neural models is the Multi- 
layer Perceptron, a universal function approximator (Hornik et al. 1989) that breaks the limitations 
of linear decision functions. The Backpropagation learning algorithm (Rumelhart et al. 1986) can, 
in principle, adjust the network weights to implement arbitrary mappings, and the network outputs 
show desirable probabilistic properties (Wan 1990, Rojas 1996). There are also unsupervised net- 
works for probability density function approximation (Kohonen 1989). However, neural models 
usually contain a large number of adjustable parameters, which is not convenient for generalization 
and, frequently, long times are required for training in relatively easy tasks. The input / output role 
of attributes cannot be changed in runtime and missing and uncertain values are poorly supported. 

Bayesian Networks, based in the concept of conditional independence, are among the most 
relevant probabilistic inference technologies (Pearl 1988, Heckerman & Wellman 1995). The joint 
density of the variables is modeled by a directed graph which explicitly represents dependence 
statements. A wide range of inferences can be performed under this framework (Chang & Fung 
1995, Lauritzen & Spiegelhalter 1988) and there are significant results on inductive learning of 
network structures (Bouckaert 1994, Cooper & Herskovits 1992, Valiveti & Oomen 1992). This 
approach is adequate when there is a large number of variables showing explicit dependences and 
simple cause-effect relations. Nevertheless, solving arbitrary queries is NP-Complete, automatic 
learning algorithms are time consuming and the allowed dependences between variables are rela- 
tively simple. 

In an attempt to mitigate some of the above drawbacks, we have developed a general and effi- 
cient inference and learning framework based on the following considerations. It is well known 
(Titterington et al. 1985, McLachlan & Basford 1988, Dalai & Hall 1983, Bernardo & Smith 1994, 
Xu & Jordan 1996) that any reasonable probability density function p(z) can be approximated up to 
the desired degree of accuracy by a finite mixture of simple components C„ i= I.. I: 

p(z) = £p{c,}/7(dc,) (1) 

The superposition of simple densities is extensively used to approximate arbitrary data depen- 
dences (Fig. 2). Maximum Likelihood estimators of the mixture parameters can be efficiently ob- 
tained from samples by the Expectation Maximization (EM) algorithm (Dempster, Laird & Rubin 
1977, Redner & Walker 1984) (see Section 4). 
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(a) (b) (c) 

Figure 2. Illustrative example of density approximation using a mixture model, (a) Sam- 
ples from a p.d.f. pix,y) showing a nonlinear dependence, (b) Mixture model for pix,y) 
with 6 gaussian components obtained by the standard EM algorithm, (c) Location of 
components. 



The decomposition of probability distributions using mixtures has been frequently applied to 
unsupervised learning tasks, especially Cluster Analysis (McLachlan & Basford 1988, Duda & 
Hart 1973, Fukunaga 1990): the a posteriori probabilities of each postulated category are computed 
for all the examples, which are labeled according to the most probable source density. However, 
mixture models are specially useful in nonparametric supervised learning situations. For instance, 
the class conditional densities required in Statistical Pattern Recognition were individually ap- 
proximated in (Priebe & Marchette 1991, Traven 1991) by finite mixtures; hierarchical mixtures of 
linear models were proposed in (Jordan & Jacobs 1994, Peng et. al 1995); mixtures of factor ana- 
lyzers have been developed in (Ghahramani & Hinton 1996, Hinton, Dayan, & Re vow 1997) and 
mixture models have been also useful for feature selection (Pudil et al. 1995). Mixture modeling is 
a growing semiparametric probabilistic learning methodology with applications in many research 
areas (Weiss & Adelson 1995, Fan et al. 1996, Moghaddam & Pentland 1997). 

This paper introduces a framework for probabilistic inference and learning from arbitrary ui- 
certain data: any piece of exact or uncertain information about both input and output values is con- 
sistently handled in the inference and learning stages. We approximate both the joint density p(z) 
(model of the domain) and the relative likelihood function p(S\z) (describing the available informa- 
tion) by a specific mixture model with factorized conjugate components, in such a way that nu- 
merical integration is avoided in the computation of any desired estimator, marginal or conditional 
density. 

The advantages of modeling arbitrary densities using mixtures of natural conjugate compo- 
nents were already shown in (Dalai & Hall 1983), and, recently, inference procedures based in a 
similar idea have been proposed in (Ghahramani & Jordan 1994, Cohn et al. 1996, Peng et al. 1995, 
Palm 1994). However, our method efficiently handles uncertain data using explicit likelihood 
functions, which has not been extensively used before in Machine Learning, Pattern Recognition or 
related areas. We will follow standard probabilistic principles, providing natural statistical valida- 
tion procedures. 

The organization of the paper is as follows. Section 2 reviews some elementary results and 
concepts used in the proposed framework. Section 3 addresses the inference stage. Section 4 is 
concerned with learning, extending the EM algorithm to manage uncertain information. Section 5 
discusses the method in relation to alternative techniques and presents experimental evaluation. 
The last section summarizes the conclusions and future directions of this work. 



171 



Ruiz, L6pez-de-Teruel& Garrido 



2. Preliminaries 



2.1 A Calculus of Generalized Normals 

In many applications, the instances of the domain are represented simultaneously by continuous 

and symbolic or discrete variables (as in Wilson & Martinez 1997). To simplify notation, we will 
denote both probability impulses and Gaussian densities by means of a common formalism. The 
generalized normal 3^{x,\x,,<3) denotes a probability density function with the following proper- 
ties: 



If a > 0, JAr(x,|i.,a) = X- exp 

V27ua 



If a = 0, :K{x,\i,d) = J^(x,iL,0) = J^(x,ii) = 5(x-|x) 

J^(x,[i,o) is a Gaussian density with mean |a, and standard deviation a > 0. When the dispersion is 
zero, reduces to a Dirac's delta function located at \i. In both cases J\f is a proper p.d.f.: 

f Mix,ii,o)dx=l !N'ix,\i,o)>0 

The product of generalized normals can be elegantly expressed (Papoulis 1991 pp. 258, Berger 
1985) by: 

for Oi+Oi >0: 3^{x,\X,i,<5i) ■ J^{x,\l2,<32) = ^(x,r[,E) ■ JV(|li,|X2, -^/of+of ) (2) 
where the mean r| and dispersion e of the new normal are given by: 



This relation is useful for computing the integral of the product of two generalized normals: 

for oi+oi >0: J^(x,[iu<Ji) ■ J^(x,[i2,C2) dx = J^([lu [i2, Jcl +0^) (3) 

And, for consistency, we define 

for oi = ©2 = 0: [ .^Ar(;c,|J.i) .^Ar(;c,|J.2) dx = J'^([Li,[L2) = /{|J.i=|J.2} 

where I[predicate} = 1 if predicate is true and zero otherwise. Virtually any reasonable univariate 
probability distribution or likelihood function can be accurately modeled by an appropriate mixture 
of generalized normals. In particular, p.d.f.'s over symbolic variables are mixtures of impulses. 
Without loss of generality, symbols may be arbitrarily mapped to specific numbers and represented 
over numeric axes. Integrals over discrete domains become sums. 

Example 4 : Let us approximate the p.d.f. p(x) of a mixed continuous and symbolic valued ran- 
dom variable xhy a mixture of generalized normals. Assume that x takes with probability 0.4 
the exact value 10 (with a special meaning), and with probability 0.6 a random value continu- 
ously distributed following the triangular shape shown in Fig. 3. The density p{x) can be accu- 
rately approximated (see Section 4) using 4 generalized normals: 

pix) = AO!hrix,lO) + .2lJ\rix,M,.2'i) + .28J\rix,A5,.28) + .li:A/'(;c,.99,.21) 
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2.2 Modeling Uncertainty: The Likelihood Principle 

Assume that the value of a random variable z must be inferred from a certain observation or sub- 
jective information S. If z has been drawn from p(z) and the measurement or judgment process is 
characterized by a conditional p{S\z), our knowledge on z is updated according to ;7(zlS)=p(z) ;?(Slz) 
/ p{S), where p{S) = Iz p(5lz) p{z) dz (see Fig. 4). 

The likelihood function fs{z) = p{S\z) is the probability density ascribed to S by each possible 
z. It is an arbitrary nonnegative function over z that can be interpreted in two alternative ways. It 
can be the "objective" conditional distribution p{S\z) of a physical measurement process (e.g. a 
model of sensor noise, specifying bias and variance of the observable S for every possible true 
value z), also known as error model. It can also be a "subjective" judgment about the chance of the 
different z values (e.g. intervals, more likely regions, etc.), based on vague or difficult to formalize 
information. The dispersion oifs{z) is directly related to the uncertainty associated to the measure- 
ment process. Following the likelihood principle (Berger 1985), we explicitly assume that all the 
experimental information required to perform probabilistic inference is contained in the likelihood 
function /s(z). 




P{s\z: 



p{s) 




p{z\so) 



prior 



p{s\zi) model of 

measurement 




observable 



likelihood of 
observation So 



^ posterior 



Figure 4. Illustration of the elementary Bayesian univariate inference process. 
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2.3 Inference Using Mixtures of Conjugate Densities 

The computation of p(z\S) may be hard, unless p(z) and p(S\z) belong to special (conjugate) fami- 
lies (Berger 1985, Bernardo & Smith 1994). In this case the posterior density can be analytically 
obtained from the parameters of the prior and the likelihood, avoiding numeric integration. The 
prior, the likelihood and the posterior are in the same mathematical family. The "belief structure" is 
closed under the inference process. 

Example 5 : In the univariate case, assume that z is known to be normally distributed around r 
with dispersion a,., i.e. p(z) = !M(z, r, a,). Assume also that our measurement device has Gaus- 
sian noise, so the observed values are distributed according to p(s\z) = ^{s, z, CT^). Therefore, 
if we observe a certain value So, from the property of the product of generaUzed normals in eq. 
(2), the posterior knowledge on z becomes another normal J^iz, |A, e). The new expected lo- 
cation of z can be expressed as a weighted average of r and 50:11 = 7 *o + (l-T)'' and the uncer- 
tainty is reduced to = y . The coefficient y = / ( + ) quantifies the relative im- 
portance of the experiment with respect to the prior. 

This computational advantage can be extended to the general case by using mixtures of conju- 
gate families (Dalai & Hall 1983) to approximate the desired joint probability distribution and the 
likelihood function. 

Example 6 : If the domain and the Ukelihood are modeled respectively by 

(where 7t^, and 8^ depend explicitly on the observed s^), then the posterior can be also 
written as the following mixture: 

p{zK)=Y.K^^^^^i,r\,r) (4) 

i, r 

From properties (2) and (3), the parameters V and A,-^ and the weights 0-^ are given by: 

_^TV+8^ g, 8, 

^ i^.Ti, JV-(^t,,ri,,Vof+e;) 
£p,7i, .:A/"(|i,,Tl,,7^ + ef) 

k,l 

2.4 The Role of Factorization 

Given a multivariate observation z partitioned into two subvectors, z = (x, y), assume that we are 
interested in inferring the value of the unknown attributes y from the observed attributes x. Note 
that if X and J are statistically independent, the joint density is factorizable: p(z) = p(x, y) = p(x) 
p(y) and, therefore, the posterior p(y\x) equals the prior marginal p(y). The observed x carries no 
predictive information about y and the optimum estimators do no depend on x. For instance, 
Jms£(^)= Efjbc} = E{y} and y^p{x) = argmaXyp{y) . This is the simplest estimation task. No 

runtime computations are required for the optimum solution, which may be precalculated. 
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In realistic situations the variables are statistically dependent. In general, the joint density can- 
not be factorized and the required marginal densities may be hard to compute. However, interesting 
consequences arise if the joint density is expressed as a finite mixture of factorized (with independ- 
ent variables) components Ci, C2, Ci : 

p(z) = p(z\...,z'') = Y, nc,}p(z\q) = Y, nc,]Ylp(z'\c,) (5) 

This structure is convenient for inference purposes. In particular, in terms of the desired parti- 
tion of z = (x,yy. 

p(z) = p(x,y) = p(x\q) p(y\q) 

i 

SO the marginal densities are mixtures of the marginal components: 

p(x) = lp(x,y)dy = '£ nC,}p(x\q) 

i 

p(y) = L nQ}p(y\q) 

i 

and the desired conditional densities are also mixtures of the marginal components: 

p{y\x) = Y,^,{x)p{y\C,) (6) 

where the weights a, (jc) are the probabilities that the observed jc has been generated by each com- 
ponent C, : 

Y,nc,}p{x\c^) 

j 

The p.d.f. approximation capabilities of mixture models with factorized components remain 
unchanged, at the cost of a possibly higher number of components to obtain the desired degree of 
accuracy, avoiding "artifacts" (see Fig. 5). Section 5.2 discusses the implications of factorization in 
relation with alternative model structures. 




Figure 5. (a) Density approximation for the data in Fig. 2, using a mixture with 8 factor- 
ized components, (b) Location of components. Note how an arbitrary dependence can be 
represented as a mixture of components which itself have independent variables (observe 
that a somewhat "smoother" solution could be obtained increasing the number of com- 
ponents). 
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3. The MFGN Framework 

The previous concepts will be integrated in a general probabilistic inference framework that we call 
MFGN (Mixtures of Factorized Generalized Normals). Fig. 6 shows the abstract dependence rela- 
tions among attributes in a generic domain (upper section of the figure) and between the attributes 
and the observed information (lower section). In the MFGN framework, both relations are modeled 
by finite mixtures of products of generalized normals. The key idea is using factorization to cope 
with multivariate domains and heterogeneous attribute vectors, and conjugate densities to effi- 
ciently perform inferences given arbitrary uncertain information. In this section, we will derive the 
main inference expressions. The learning stage will be described in Section 4. 




Figure 6. Generic dependences in the inference process. 
3.1 Modeling Attribute Dependences in the Domain 

In the MFGN framework the attribute dependencies in the domain are modeled by a joint density in 
the form of a finite mixture of factored components, as in expression (5), where the component 

marginals p{z' \C^) = 3^{z' , |l/ , <?/ ) are generalized normals: 

M2) = L^n-^(^''l^/'^/) i=l..l, i=l..n, (7) 

j 

If desired, the terms associated to the pure symbolic attributes z' (with all the o/ = 0) can be 
collected in such a way that the component marginals are expressed as mixtures of impulses: 

p(^^IC,) = 2^r,^„.:A^(^^(o) (8) 

to 

where tf^ = P{z^ = COlCJ is the probability that z^ takes its CO-th value in component C^ . This 

manipulation reduces the number / of global components in the mixture. The adjustable parameters 

of the model are the proportions = P{CJ and the mean value |l/ and dispersion a/ of the j-th 

attribute in the ?-th component (or, for the symbolic attributes, the probabilities tl^). While the 

structure (8) will be explicitly used for symbolic attributes in applications and illustrative exam- 
ples, most of the mathematical derivations will be made over the concise expression (7). 
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When all variables are continuous, the MFGN architecture reduces to a mixture of gaussians 
with diagonal covariance matrices. The proposed factorized structure extends the properties of 
diagonal covariance matrices to heterogeneous attribute vectors. We are interested in joint models, 
which support inferences from partial information about any subset of variables. Note that there is 
not an easy way to define a measure of statistical depencence between symbolic and continuous 
attributes, to be used as a parameter of some probability density function'. The required "hetereo- 
geneous" dependence model can be conveniently captured by superposition of simple factorized 
(with independent variables) densities. 

Example 7 : Figure 7 shows an illustrative 3-attribute data set (x and y are continuous and z is 
symbolic) and the components of the MFGN approximation obtained by the EM algorithm 
(see Section 4) for their joint density. The parameters of the mixture are shown in Table 1. 
Note that, because of the overlapped structure of the data, some components (5 and 6) are "as- 
signed" to both values of the symboUc attribute z. 




_] I I I I I I I I L 

-8-1048 -8-4048 

X E-1 X E-1 



(a) (b) 
Figure 7. (a) Simple data set with two continuous and one symbolic attribute, 
(b) Location of the mixture components. 
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Table 1. Parameters of the Mixture Model for the Data Set in Fig. 7. 



3.2 Modeling Arbitrary Information about Instances 

The available information about a particular instance z is denoted by S. Following the likelihood 
principle, we are not concerned with the true nature of S, whether it is some kind of physical meas- 



' For this reason, in pattern classification tasks separate models are typically built for each class-conditional daisity. 
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urement or a more subjective judgment about the location of z in the attribute space. All we need to 
update our knowledge about z, in the form of the posterior p(z\S), is the relative likelihood function 
/?(Slz) of the observed S. In general, p(S\z) can be any nonnegative multivariable function /s(z) over 
the domain. In the objective case, statistical studies of the measurement process can be used to 
determine the likelihood function. In the subjective case, it may be obtained by standard distribu- 
tion elicitation techniques (Berger 1985). In either case, under the MFGN framework, the likeli- 
hood function of the available information to be used in the inference process will be approxi- 
mated, up to the desired degree of accuracy, by a sum of products of generalized normals: 

p(s\z) = T.ns\s,}p(sM) = T,ns\s,}Ylp(s^\z') 

r ^ i 

= lLT^r\{^(z\si,zi) (9) 

r j 

Without loss of generality, the available knowledge is structured as a weighted disjunction S = 
{tciSi or 'H2S2 ■■■ OR TlgSg } of conjunctions = {si AND ... AND 5" ) of elementary uncertain 
observations in the form of generalized normal likelihoods !M{z' ,sl ,^1) centered at 5/ with 
uncertainty 8;! . The measurement process can be interpreted as the result of R (objective or sub- 
jective) sensors , providing conditionally independent information p{sl\z') about the attributes 
(each si only depends on z') with relative strength . Note that any complex uncertain informa- 
tion about an instance z, expressed as a nested combination of elementary uncertain beliefs 5/ 

about using "probabilistic connectives", can be ultimately expressed by structure (9) (OR trans- 
lates to addition, and translates to product and the product of two generalized normals over the 
same attribute becomes a single, weighted normal). 

Example 8 : Consider the hypothetical computer vision domain in Example 1. Assume that the 
information about an object z is the following: "Area is around a and DISTANCE is around b or, 
more likely, SHAPE is surely triangular or else circular and Area is around c and Angle is 
around d or equal to e". This structured piece of information can be formalized as: 

p{S\z) = .3 m{z\ a, ) M{z\ b, 8J] 

-1- .7 [ (.9Jsr{z\triang)+.lJ^{z\circ))J\r(z\ c, ej (J^{z\ d, ej)+J^(z\e)) ] 

which, expanded, becomes the mixture of 5 factorized components operationally represented by 
the parameters shown in Table 2. 

In a simpler situation, the available information about z could be a conjunction of uncertain at- 
tributes similar to {Color = red 0.8 green 0.2} and {Area = 3 ± .5} and {Shape = rectangular 0.6 
circular 0.3 triangular 0.1}. The likelihood of Shape values can be obtained from the output of a 
simple pattern classifier (e.g. K-N-nearest neighbors) over moment invariants, while attributes 
as Color and Area are directly extracted from the image. In this case we could be interested in 
the distribution of values for other attributes as Texture and ObjectType. Alternatively, we 
could start from {ObjectType = door 0.6 window 0.4} and {Texture = rough} in order to deter- 
mine the probabilities of Color and Angle values for selecting a promising search region. 
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Table 2. Parameters of the Uncertain Information Model in Example 8. 



3.3 The Joint Model-Observation Density 

The generic dependence structure in Fig. 6 is implemented by the MFGN framework as shown in 
Fig. 8. The upper section of the figure is the model of nature, obtained in a previous learning stage 
and used for inference without further changes. Dependences among attributes are conducted 
through an intermediary hidden or latent component C,. The lower section represents the available 
uncertain information, measurement model or query structure associated to each particular infa*- 
ence operation. 




Figure 8. Structure of the MFGN model. The attributes are conditionally independent. 
The measurement process is modeled by a collection of independent "virtual" sensors 



The joint density of the relevant variables becomes: 

p(C, ,z,5, , 5) = P{5ls, } p{s,\z) p(z\q ) P{C, } 

= P{q}P{S\s^}Ylp(si\z') p(z'\q) 

j 

= Pi Tirll^(z\4,ei) mz',iil,<yi) (10) 
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Now we will derive an alternative expression for eq. (10) which is more convenient for com- 
puting the marginal densities of any desired variable. Using the following relation: 

p{si\z^) p(z^lC,) = p{z\si\C,) = p{z^\si,C,) p{si\C.) 

and properties (2) and (3), we define the "dual" densities of the model: 

K = = L M^/IzO p{z^\C,)dz^ =msl,[il,pi) (11) 

^iM')^p(z'\si,c,) = ^^f^p(z'\q) = mz\vi^,x\^) (12) 

where the parameters p/^ , v/^ and are given by: 

, ^ (a/)^.,'+(£;)V/ 

Kr = J 

P/^ is the likelihood of the r-th elementary observation 5/ of the j-th attribute z^ in each 
component Q and ¥/,r(^0 is the effect of the r-th elementary information about the j-th at- 
tribute z^ over the marginal component p{z'\ C, ) in each component C, . Using the above nota- 
tion, the MFGN model structure can be conveniently written as: 

p(q,z,s^,S) = ^]nJl^l^Vi(z^) (13) 

j 

3.4 The Posterior Density 

In the inference process the available information is combined with the model of the domain to 
update our knowledge about a particular object. Given a new piece of information S we must com- 
pute the posterior distribution p(y\S) of the desired target attributes j c z. Then, estimators 

y(S) = y can be obtained from p(y\S) to minimize any suitable average loss function. This is 
efficiently supported under the MFGN framework regardless of the complexity of the domain p{z) 
and the structure of the available information S = {n^ } . 

The attributes are partitioned into two subvectors z = (x, y), where j = { z'' } are the desired 
target attributes and jc = { z" } are the rest of attributes. Accordingly, each component of the 
available information S is partitioned as = (s%s^) . The information about the target attributes 
y in the r-th observation, independent from the model p{z), is denoted by (often 3^ is just missing 
and there are no such pieces of information) and represents the information about the rest of 
attributes x. Using this convention we can write: 
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p(z,S) = p(x,y,S) = Y,P,%, P,, %^x) 

i,r 

Where (3-^ is the hkehhood of the r-th conjunction of 5 in component C- : 

K^YlK (14) 

and the terms \|//^ (z^ ) are grouped according to the partition of z = (x, y): 

o d 

The desired posterior p{y\S) = p(y,S) I p(S) can be computed from the joint p{z,S) by margi- 
nalization: along x to obtain p(y,S) and along all z to obtain p(S). Note that each univariate margi- 

nalization of p(z,S) along attribute eliminates all the terms in the sum (13): 

piy,S) = \^pix,y,S) dx = Y^P,%^ |3,, 

i,r 

p(S)^lp{z,S)dz = Y.PiK P,> 
Therefore, the posterior density can be compactly written as: 

p(y\s) = Y,GL,^r'i'iAy) (15) 

where a,^ is the probability that the object z has been generated by component Q and the ele- 
mentary information is true, given the total information S: 

k,l 

and ^( ^(j)" P(jl*'r 'C",) is the marginal density p{y\C^) of the desired attributes in the ?-th 

component, modified by the contribution of all the associated . Since p(y\s^ ,C-) = 
p(y\s^,Cf) , the expression (16) also follows from the expansion: 

p(y\S) = Y,p(y\s^,Q)nC„s^S} 

In summary, when the joint density and the likelihood function are approximated by mixture 
models with the proposed structure, the computation of conditional densities given events of arbi- 
trary "geometry" is notably simplified. Factorized components reduce multidimensional integration 
to simple combination of univariate integrals and conjugate families avoid numeric integration. 
This property is illustrated in Fig. 9. 



181 



Ruiz, L6pez-de-Teruel& Garrido 



s 




Figure 9. Graphical illustration of the essential property of the MFGN framework. Con- 
sider the MSB estimate for y, conditioned to the event that (y, x\ x^) is in the cylindrical 
region S. The required multidimensional integrations are computed analytically in terms 
of the marginal likelihoods (3^, associated to each attribute and each pair of components 
C, and Sr of the models for p(y, x', x^) and for S, respectively. In this case ^i^fy)=p(y\Ci) 
because no information about y is supphed in S. 



Example 9 : Fig. 10. a shows the joint density of two continuous variables x and y. It is modeled 
as a mixture with 30 factorized generalized normals. Fig. lO.b shows the likelihood function 
of the event Si = {{x = y OR x = -y) AND >'>0}. Fig. 10. c shows the posterior joint density 
p(x,y\Si). Fig. 10. d shows the UkeUhood function of the event S2 = {(x,y) = (0,0) OR x=3}. 
Fig. lO.e shows the posterior joint density p(x,y\S2)- Fig. 10. f and 10. g show respectively the 
posterior marginal density pix\S2) and p(y\S2). These complex inferences are analytically com- 
puted under the MFGN framework, without any numeric integration. 
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P 




(d) (e) 
Figure 10. Illustrative examples of probabilistic inference from arbitrary imcertain in- 
formation in the MFGN framework (see Example 9). 
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(f) (g) 
Figure 10. (cont.). 

3.5 Expressions for the Estimators 

Approximations to the optimum estimators can be easily obtained by taking advantage of the 
mathematically convenient structure of the posterior density. Under the MFGN framework, the 
conditional expected value of any function g(y) becomes a linear combination of constants: 

B{g(y)\S} = lg(y)p(y\S)dy = 

i,r i,r 

where ¥,^^{g{y)} = E{g(y)\s^ ,Cf} is the expected value of g(y) in the i-th component^ modi- 
fied^ by the r-th observation : 

E,Jg(j)} ^ ig(y) U^iz'KXr) dy 

d 

We can now analytically compute the desired optimum estimators. For instance, the MSE es- 
timator for a single continuous attribute y-z' requires the mean values ^ {z'^ } = v'^^ : 

From our explicit expression for p{y\S) we can also compute the conditional cost: 
4,^(S) = e{(>;-5)^,^(5))'|5} = E{/I5}-5)^,^(5) = 

( V 

i,r \ i,r ) 



^ Note that computing the conditional expected value of an arbitrary function g(y) of several variables may be difficult. In 
general g(y) can be expanded as a power series to obtain E{g(y)IS} in terms of moments of p{y\S). 

^ When S is just (there is no information about the target attributes) the constants E, ,.[g(y) j can be precomputed from 
the model of nature p{z) after the learning stage. 
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Therefore, given 5, from Tchevichev inequahty we can answer y = j^^^ (5) + 2e^^g (5) 
with a confidence level above 75%. When the shape of p{y\S) is complex it must be reported ex- 
plicitly (the point estimator j = _y (5) only makes sense if p(y\S) is unimodal). 

Example 10 : Nonlinear regression. Fig. 11 shows the mixture components and regression 
lines (with a confidence band of two standard deviations) obtained in a simple example of 
nonlinear dependence between two variables. In this case the joint density can be adequately 
approximated by 3 or 4 components: MSB (1 component) = 0.532, MSB (2 comp.) = 0.449, 
MSB (3 comp.) = 0.382, MSB (4 comp.) = 0.381. 




-12 -8-4 4 8 -12 -8-4 4 8 

X E-1 X E-1 

(a) (b) 
Figure 11. Nonlinear regression example: (a) 2 components, (b) 4 components. 

When the target y is symbolic we must compute the posterior probability of each value. In this 
case all the A,^^ = and the vf^ = are the possible values co taken by y = z"* . Collecting to- 
gether all the vf^ = CO, as in (8), eq. (15) can be written as: 

where CC,^^ are the coefficients of the impulses located at CO. The posterior probability of each 
value is: 

i,r 

For instance, the minimum error probability estimator (EP) is: 

y^p(S) = argmax „ 

and any desired rejection threshold can be easily established. We can reject the decision if the en- 
tropy of the posterior, H = -T.^ qa log qa, or the estimated error probability, E = 1- max qa, are too 
high. 

Bxample 11 .- Nonparametric Pattern Recognition. Fig. 12 shows a bivariate data set with ele- 
ments from two different categories, represented as the value of an additional symbohc attrib- 
ute. The joint density can be satisfactorily approximated by a 6-component mixture (Fig. 
12.a). The decision regions when the rejection threshold was set to 0.9 are shown in Fig. 12.b. 
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Note that Statistical Pattern Classification usually start from an (implicit or explicit) approxi- 
mation to the class-conditional densities. In contrast, we start from the joint density, from 
which the class-conditional densities can be easily derived (Fig. 12.c). 




-10 -5 5 -10 -5 5 -10 -5 5 



(a) (b) (c) 

Figure 12. Simple nonparametric 2-feature pattern recognition task and its 3-attribute 
joint mixture model: (a) Feature space and mixture components, (b) Decision boundary, 
(c) One of the class-conditional densities. 

The computation of the optimum estimators for other loss functions is straightforward. Ob- 
serve that the estimators are based on the combination of different rules, weighted by their degree 
of applicability. This is a typical structure used by many other decision methods, hi our case, since 
the components of the joint density have independent variables the rules reduce to constants, the 
simplest type of rule. 

3.6 Examples of Elementary Pieces of Information 

Some important types of elementary observations s/ about are shown, including the corre- 
sponding likelihoods P/^ and modified marginals (/=<30 required in expression (15). 

Exact information: si = . The observation is modeled by an impulse: p{sl\z') = 
J\r(si,z') = 5(5/ -zO • Therefore: 

=:A/-(./,^/,o/) 

The contribution P/^ of exact information about the input attribute z^ is the standard likeh- 
hood p(z^\C.) of the observed value z^ in each component. On the other hand, if we acquire 
exact information about a target attribute z^ (when there is only one (^=1) elementary observation 
and = z^ ) then the inference process is trivially not required: p(z^ \S) = 6(z^ - s-') . 

Gaussian noise with bias r\l and standard deviation : The observation is modeled by a 1- 
component mixture: p{si\z' ) = 0^{sl ,z' + T|;! ,£;!), which can also be expressed as a 95% con- 
fidence interval z^ = si + r|;! ±2el . From property (2-2): 
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The effect of a noisy input = ± 2el is equivalent to the effect of an exact input = 

in a mixture with components of larger variance: o/ sjioj)^ + (e/)^ . Uncertainty spreads the 
effect of the observation, increasing the contribution of distant components. 

Example 12 : Fig. 13. a shows a simple two-attribute domain approximated by a 3-component 
mixture. We are interested in the marginal density of attribute x given different degrees of un- 
certainty e in the input attribute y ~ At 2e, modeled by p(s^\y) = J^(s' ,A,£) . If e = 

we have the sharpest density (A) in Fig. 13.b, providing .4±.5. If e = .25 we obtain density 
(B) and x=-.3±.7. Finally, if e = .5 we obtain density (C) and x«-.2±.8. Obviously, as the un- 
certainty in y increases, so does the uncertainty in x. The expected value of x moves towards 
more distant components, which become more hkely as the probabihty distribution of y ex- 
pands. In this situation an interesting effect appears: the mode of the marginal density does not 
change at the same rate than the mean. Uncertainty in y skews p(x). This effect suggests that 
the optimum estimators for different loss functions are not equally robust against uncCTtainty. 




X E-l X E-1 

(a) (b) 
Figure 13. Effect of the amount of uncertainty (see text), (a) Data set and 3-component 
model, (b) p{x I imcertain y's aroimd 0.4). 

For the output role, becomes the original marginal, modified in location and disper- 

sion towards 5/ according to the factor J = (oj)^ / [(a/ ) ^ + (e ;;! ) ^ ] , which quantifies the relative 
importance of the observation: 

xviAz') = J^{z\y(s^-ni) + (l-y)^i, y'^'ei) 

Missing data. When there is no information about the j-th attribute, 5/ = {z^ = ?} , the observa- 
tion can be modeled by /'(s/izO = constant or, equivalently, p(si\z^) = !M{sl ,a,b) with a 
arbitrary and — > 0°. All the components contribute with the same weight: 

P/^ = p{z' = anything\Cf) <x constant = 1 
If the target is missing the '^■^(z^ ) reduce to the original marginal components: 
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Arbitrary uncertainty. In general, any unidimensional relative likelihood function can be ap- 
proximated by a mixture of generalized normals, as shown in Example 6, where P/^ and '^l^iz^) 
are given respectively by eqs. (11) and (12). 

Intervals. Some useful functions cannot be accurately approximated by a small number of normal 
components. A typical example is the indicator function of an interval, used to model an uncertain 

observation where all the values are equally likely: sj, = {z^ G (a,b)} . If is only considered as 
input, we can use the shortcut |3/^ = F/(b) — F/{a) , where FJ (z^) is the cumulative distribu- 
tion of the normal marginal component p{z^\C-). Unfortunately, the expression for ^{^{z'), 
required for z ' considered as output, may not be so useful for computing certain optimum estima- 
tors, ^l^r^z') is the restriction of p{z'\C^ to the interval {a,b) and normalized by . 

Disjunction and conjunction of events. Finally, standard probability rules are used to build 
structured information from simple observations: if from subjective judgments or objective evi- 
dence we ascribe relative degrees of credibility Q^, to several observations s'. about z' , the overall 

likelihood becomes P/ = E^9,^P/^ . In particular, if 5^={z^=C0[ OR z^ = "^^e two 

possibilities are equiprobable then P/ = /?(COjlC, ) -I- />((j02lCj. ) . Analogously, conjunctions of 
events translate to multiplication of likelihood functions. 

3.7 Summary of tlie Inference Procedure 

Once the domain p(z) has been adequately modeled in the learning process (as explained in Section 
4), the system enters the inference stage over new, partially specified objects. From the parameters 

of the domain p{z) ( ^, |x/ and o/) and the parameters of the model of the observation p{S\z) 
(7t^, si and 8;! ), we must obtain the parameters P/^, vf^ and A,^^ of the desired marginal poste- 
rior densities and estimators. The inference procedure comprises the following steps: 

■ Compute the elementary likelihoods p/^ , using eq. (11). 

■ Obtain the product P^ ^ for each conjunction and component C- , using eq. (14). 

■ Normalize Tl^ p, ^ to obtain the coefficients a, ^ of the posterior, using eq. (16). 

■ Choose the desired target attributes y = {z'' } and compute the parameters vf^ , and 
A,^ ^ of the modified component marginal densities \|/f r(^'') using eq. (12). 

■ Report the joint posterior density ofy. Show graphs of the posterior marginal densities 

of the desired attributes z^ using eq. (15). Provide optimum (point, interval, etc.) esti- 
mators using eq. (17). 

Example 13 : Iris Data. The inference procedure is illustrated over the well known Iris bench- 
mark: 150 objects represented by four numeric features (x, y, z and w) and one symbolic cate- 
gory U e {f/i (setosa), U2 (versicolor), U-} (virginica)] . The whole data set was divided into 
two disjoints subsets for training and validation. The joint density can be satisfactorily approxi- 
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mated (see Section 4) by a 6-component mixture (the error rate classifying U in the vaUdation 
set without rejection was 2.67%). Fig. 14 shows two projections of the 150 examples and the lo- 
cation of the mixture components learned from the training subset. The parameters of the mix- 
ture are shown in Table 3. 
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Table 3. Parameters of the Iris Data Joint Density Model 



Table 4 shows the results of the inference process in the following illustrative situations: 
Case 1: Attribute z is known: S = {z = 5}. 

Case 2: Attributes x and f/ are known: S = {{x = 5.5) AND {U=U2)}. 
Case 3: Attribute x is uncertain: S = {x= 7±1 }. 

Case 4: Attributes x and w are uncertain: S = {(x = 7±1) and (w = 1±0.5)}. Note that uncer- 
tainty decreases when more information is supplied (compare with Case 3). 

Case 5: Structured query expressed in terms of logical connectives over uncertain elementary 
events: S = { [(z s 1±3) OR (z s 7±3)] AND [(f/ = i7i) OR (f/ = f/z)] } • 
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Table 4. Some Inference Results Over the IRIS Domain 



The consistency of the results can be visually checked in Fig. 14. Finally, Table 5 shows the 
elementary hkehhoods P/^ of Case 5, illustrating the essence of the method. 
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Table 5. Elementary likelihoods in Case 5 from Table 4. 



3.8 Independent Measurements 

One of the key features of the MFGN framework is the ability to infer over arbitrary relational 
knowledge about the attributes, in the form of a likelihood function adequately approximated by a 
mixture model with the structure of eq. (9). For instance, we could answer questions as: "what hap- 
pens to when z' tends to be less than z^ ?" (i.e., when p{S\z) is high in the region z' - Z^ < ). 
However, there are situations where the observations over each single attribute z^ are statistically 
independent: we have information about attributes (e.g. z' is around a and z^ is around b) but not 
about attribute relations. We will pay attention to this particular case because it illustrates the role 
of the main MFGN framework elements. Furthermore, many practical applications can be satisfac- 
torily solved under the assumption of independent measurements or judgments. In this case, the 
likelihood of the available information can be expressed as the conjunction of n "marginal" obser- 
vations about z'' : 

P(S\z) = YIp(s^\z') (18) 
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This means that the sum of products in equation (9) is "complete", i.e., it includes all the elements 
in the A^-fold cartesian product of attributes: 

j r 

where H^TT;! —'^r- This factored likelihood function can be considered also as a 1-component 
mixture (with R=\ in (9) and = ) where the marginal observation models are allowed to be 
mixtures of generalized normals: p{s^\z') = 'L^.n^^.J^iz^ , s^^.,e^^, ) . In this case we can even think 
of "function valued" attributes z = , where f \z^) = p{s^\z'') models the 

range and relative likelihood of the values of Z'' ■ Loosely speaking, attributes with concentrated 
f ^ iz^) may be considered as inputs, and attributes with high dispersion play the role of outputs. 

Since y is conditionally independent of s * given , the posterior can be obtained from the expan- 
sion: 

p{y\s) = LmjIS,c,) p{c,is} = Y,piy\c,y) p{c,is} (19) 

i i 

The interpretation of (19) is straightforward. The effect of s* over y - {z'} must be computed 
through X = {z°} and the components . Then, a simple Bayesian update of p{y\s'') as a new 
prior is made using (see Fig. 15). 




Figure 15. Structure of the MFGN inference process from independent pieces of infor- 
mation. In this case, the likelihood fimction is also factorizable. The data flow in the in- 
ference process is shown by dotted arrows. 
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4. Learning from Uncertain Information 

In the previous section, we have described the inference process from uncertain information under 

the MFGN framework. Now we will develop a learning algorithm for the model of the domain, 
where the training examples will be also uncertain. Specifically, we must find the parameters P- , 

|l/ , o/ (or t/^) of a mixture with structure (7) to approximate the true joint density p(z) from a 

training i.i.d. random sample {z*^*'}, k=\..M, partially known through the associated likelihood 
functions {5*} with structure (9). 

4.1 Overview of the EM Algorithm 

Maximum Likelihood estimates for the parameters of mixture models are usually computed by the 

well-known Expectation-Maximization (EM) algorithm (Dempster, Laird and Rubin 1997, Redner 
and Walker 1984, Tanner 1996), based in the following idea. In principle, the maximization of the 

training sample likelihood J = n^/7(z*''') is a mathematically complex task due to the product of 

sums structure. However, note that J could be conveniently expressed for maximization if the com- 
ponents that generated each example were known (this is called "complete data" in EM terminol- 
ogy). The underlying credit assignment problem disappears and the estimation task reduces to sev- 
eral uncoupled simple maximizations. The key idea of EM is the following: instead of maximizing 
the complete data likelihood (which is unknown), we can iteratively maximize its expected value 
given the training sample and the current mixture parameters. It can be shown that this process 
eventually achieves a local maximum of /. 

Instead of a rigorous derivation of the EM algorithm, to be found in the references (see espe- 
cially McLachlan and Krishnan, 1997), we will present here a more heuristic justification which 
provides insight for generalizing the EM algorithm to accept uncertain examples. We will review 
first the simplest case, where no missing or uncertain values are allowed in the training set. The 
parameters of the mixture are conditional expectations: 

E„c. {8iz)\q)} = i8iz)piz\q) dz (20) 

In particular, |x/ = E{z^lC,} , (a/)' = E{(z^ -^i/)'lC,} and tj^ = E{I{z^ = a)}ICJ . The 

mixture proportions are = E{ PfC^lz} } . 

We rewrite the conditional expectation (20) using Bayes Theorem in the form of an uncondi- 
tional expectation: 

B^,c.{8(z)\Q)}=ig(z)nQz}p(z)/nQ}dz = (21) 



= B^{g(z)F{q\z}}/P, (22) 

The EM algorithm can be interpreted as a method to iteratively update the mixture parameters 
using expression (22) in the form of an empirical average over the training data'*. Starting from a 
tentative, randomly chosen set of parameters, the following E and M steps are repeated until the 



* Expression (21) can be also used for iterative approximation of explicit functions which are not indirectly known by 
i.i.d. sampling (e.g., subjective likelihood functions sketched by the human user, as in Example 4). In this case p(z) is set 
to the target function and P{C,I z] is computed from the current mixture model. 
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total likelihood / no longer improves (the notation {expressionf^^ means that {expression) is com- 
puted with the parameters of example z**'): 

(E) Expectation step. Compute the probabilities q\'^^ = P{CJz^''^} that the k-ih example has been 
generated by the /-th component of the mixture: 

(M) Maximization step. Update the parameters of each component using all the examples, weighted 
by their probabilities ql''\ First, the a priori probabilities of each component: 

Then, for continuous variables, the mean values and standard deviations in each component: 



MP, , 



(23) 



and for symbolic variables, the probabilities of each value: 



' - MP, , 
4.2 Extension to Uncertain Values 

In general, in the MFGN framework we do not know the true values of the attributes in the 
training examples, required to compute g(z) P{C, lz} in the (empirical) expectation (22). Instead, 

we will start from uncertain observations S'*^ about the true training examples z'*^ , in the form of 
likelihood functions expressed as mixtures of generalized normals: 

r 

Therefore, we must express the expectation (22) over p(z) as an unconditional expectation 
over p(S), the distribution which generates the available information about the training set. This 
can be easily done by expanding p{z\ C, ) in terms of S: 



E^,,Ag(z)\q)} = lg(z)p(z\q)dz 
= l8(z)[lp(z\s,q) p(s\q)ds]dz 



= \s [l8(z)p(z\S,q)dz 



p{q\s}p(S)ds /P{q} (24) 



If we define 
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T,(S)^E^,s,c.{8(z)\S,q)}=ig(z)p(z\q)p(S\z)dz/p(S\q) 

then the parameters of p(z) can be finally written^ as an unconditional expectation over the observ- 
able p{S) in a form similar to eq. (22): 

E„c, igizm)} = E, {r,(S) P{C,IS}} / (25) 

This expression justifies an extended form of the EM algorithm to iteratively update the pa- 
rameters of p(z) by averaging r;(S)P{C,IS} over the available training information {S**^} 

drawn from p(S). This can be considered as a numerical/statistical method for solving p{z) in the 
integral equation: 

p(S\z) p(z)dz = p(S) 

Note that we cannot approximate p(S) as a fixed mixture in terms of /?(5IC, ) and then com- 
puting back the corresponding p(z\Ci) because, in general, p{S\z) will be different for the dif- 
ferent training examples. For the same reason, elementary deconvolution methods are not directly 
applicable. 

This kind of problem is addressed by Vapnik (1982, 1995) to perform inference from the result of 
"indirect measurements". This is an ill-posed problem, requiring regularization techniques. The 
proposed extended EM algorithm can be considered as a method for empirical regularization, in 
which the solution is restricted to be in the family of mixtures of (generalized) gaussians. EM is 
also proposed by You and Kaveh (1996) for regularization in the context of image restoration. 

The interpretation of (25) is straightforward. Since we do not know the exact z required to 
approximate the parameters of p{z) by empirically averaging g(z)l?{C-\z} , we obtain the same 
result by averaging the corresponding ^(S) PfC^IS} in the S domain, where r.(S) plays the 

role of g(z) in (22). As z is uncertain, g(z) is replaced by its expected value in each component 
given the information about 5. In particular, if there is exact knowledge about the training set at- 
tributes (5*^*^ = z**\ i.e., R = 1 and the marginal likelihoods are impulses) then (25) reduces to 
(22). Fig. 16. illustrates the approximation process performed by the extended version of the EM 
algorithm in a simple univariate situation. 

It is convenient to develop a version of the proposed Extended EM algorithm for uncertain 
training sets, structured as tables of (sub)cases x (uncertainly valued) variables (see Fig. 17). 

First, let us write eq. (24) expanding 5 in terms of its components s/. 

p(z\q,s)P{q\s) = p(z,q\s) 
= LpU>c> J PKis} = Y,p(z\q,s^) P{c>j p{s^\s} 

r r 

Therefore 

r,(S) P{c,is} = r,,(sj P{c,,sjs} 



' This result can be also obtained from the relation E^lwfe)) = Ejf Ejulvvfe)! S} } for = g(z) P(C,fe) and Bayes Theo- 
rem. 
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X E-1 

Figure 16. The extended EM algorithm iteratively reduces the (large) difference between 
(a) the true density p(z), and (b) the mixture model p(z) , indirectly through the (small) 
discrepancies between (c) the true observation density p(S) and (d) the modeled obser- 
vation density p{S) . In real cases p(S) must be estimated from a finite i.i.d. sample 
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Figure 17. Structure of the uncertain training information for the Extended EM Algo- 
rithm. The coefficients %^ are normaUzed for easy detection of the rows included in 
each uncertain example. Whenz*^ is not uncertain, S*^ reduces to a single row with 
7U'*^ = 1 and all the 8^ = 0. 



Using the notation introduced in (12), 

j 

we can write (25) as: 
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E,, {g(z) I C,} = E,|j;a,,£g(z)n¥i(^0rfz|/7^. 

In the MFGN framework the contributions T^^is^) P{C,,sJ5} to the empirical expected 
values required by the Extended EM algorithm can be obtained again without numeric integration. 
We only need to consider the case g(z) = to compute the means [ij and probabilities t/,^, and 

g(z) = (zO^ for the deviations a/ . From (12) we already know an explicit expression for the pa- 
rameters of V/,. (z^ ) = 0^{z\vl,.,X\^ ) . Hence: 

l(z^)'%Az)dz = ivlf+(K)' 

In conclusion, the steps of the Extended EM algorithm are as follows: 
(E) Expectation step. Compute all the elementary likelihoods of the training set: 

p/;:*' = j^(s',,iLi,^(Gi)'+(ei)'Y' (26) 

Obtain the likelihood of each conjunction s^*^ of example S^*^ in component Q: 

pu)^]-[p.(.) 

j 

Obtain the total likelihood of example S**^ : 



I r 



Compute the probabilities = P{C,.,s^*'lS**'} that the r-th component of the A:-th exam- 
ple has been generated by the ?-th component of the mixture: 



(M) Maximization step. Update the parameters of each component C^ using all the components 
s^^^ of alltl 
component: 



s^*^ of all the examples weighted with their probabilities q\'^^ . First, the prior probabilities of each 



Then the mean value and standard deviation in each component: 



k r 
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For symbolic variables under representation (8) we may use: 

^l^'=L^iSr=(^Y''tL (26)' 
to 

^ ^1^1^ hr P{^/ = 03} t/,. I PI J'' (27)' 

Consider the particular case in which the attributes in the training examples are contaminated 
with unbiased Gaussian noise. The likelihood of the uncertain observations is modeled by 1- 

component mixtures: p(s*'''lz^*^) = n.^(z^^''\.v^^*\£^^*') , where (£^^*')^ is the variance of 

the measurement process over z^*^*' which obtains the observed value s^^'^'' . This can be also ex- 
pressed as a confidence interval z^**' = 5^**' ±28^**' . In this case, the basic EM algorithm (23) 
can be easily modified to take into account the effect of the uncertainties 8^^*^ . In the E step, com- 
pute using the following deviations: 

o/^V(^/7T(8^ 
and, in the M step, apply the substitution: 

^^mf ^ [y^^w +(1- Y)^i/f + Y[e^wf 

where 

(a/)'-H(8^'^*^)' 

measures the relative importance of the observed 5^ **^' for computing the new |l/ and a/ . 

The previous situation illustrates how missing values must be processed in the learning stage. 
If z^^*^ is exact then 8^^*^= and y = 1, so the original algorithm (23) is not changed. In the other 
extreme, if z^^*^ is missing, which can be modeled by 8^^*^= <», we get y = and therefore the 

observation does not contribute to the new parameters at all. The correct procedure to deal 

with missing values in the MFGN framework is simply omitting them in the empirical averages. 
Note that this fact arises from the factorized structure of the mixture components, providing condi- 
tionally independent attributes. Alternative learning methods require a careful management of 
missing data to avoid biased estimators (Ghahramani & Jordan 1994). 

4.3 Evaluation of the Extended EM Algorithm 

We have studied the improvement of the parameter estimations when the uncertainty of the obser- 
vations, modeled by likelihood functions, is explicitly taken into account. The proposed Extended 
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EM is compared with the EM algorithm over the "raw" observations (Basic EM), which ignores the 
likelihood function and typically uses just its average value (e.g., given x=S±2, Basic EM uses 
x=8). We considered a synthetic 3-attribute domain with the following joint density: 

p{x,y,w) = 0.5 :Ar(x,0,2) J^(y,0,l) J^(w,white) 
+ 0.5 J\r(x,2,l) ^(y,2,2) J^{w,black) 

Different learning experiments were performed with varying degrees of uncertainty. In all 
cases the training sample size was 250. All trained models had the same structure as the true den- 
sity (2 components), since the goal of this experiment is to measure the quality of the estimation 
with respect to the amount of uncertainty, without regard of other sources of variability such as 
local minima, alternative solutions, etc., which are empirically studied in Section 5. Table 6 shows 
the mixture parameters obtained by the learning algorithms. Fig. 18 graphically shows the differ- 
ence between Extended and Basic EM in some illustrative cases. 

Case 0: Exact Data (Fig. 18.a). 

Cases m %: Results of the Extended EM learning algorithm when there is am % rate of missing 
values in the training data. 

Case 1: Basic EM when attribute y is biased +3 units with probability 0.7. Case 2: Extended EM 
algorithm over Case 1 (see Fig. IS.b). Here, the observed value is Sy=y+3 in 70% of the samples 
and Sy^y in the rest. In all samples, Basic EM uses the observed value Sy and Extended EM uses 
the explicit likelihood function /(y) = 0.3 S^y-^y) + 0.7 5(y-(5j,-3)). 

Case 3: Basic EM when attributes x and y have Gaussian noise with a = 0.5 and w is changed with 
probability 0.1. Case 4: Extended EM algorithm over Case 3. 

Case 5: Basic EM when x and y have Gaussian noise with a - I and w is changed with probability 
0.2. Case 6: Extended EM algorithm over Case 5 (see Fig. 18.c). 

Case 7: Basic EM when x and y have Gaussian noise with a = 2 and w is changed with probability 
0.3. Case 8: Extended EM algorithm over Case 7 (see Fig. IB.d). 

Case 9: Extended EM when values >'>3 are missing (censoring). Case 10: Extended EM over Case 
9 when the missing y values are assumed to be distributed as J^(y, 4, 1), providing some addi- 
tional information on the data generation mechanism. 

Table 6 and Fig. 18 confirm that for small amounts of deterioration in relation to the sample 
size, the estimates computed by the basic EM Algorithm over the "raw" observed data are similar 
to those obtained by the Extended EM algorithm (e.g.. Cases 3 and 4). However, when the data sets 
are moderately deteriorated the true joint density can be correctly recovered by Extended EM using 
the likelihood functions of the attributes instead of the raw observed data (e.g.. Cases 5 and 6, Fig. 
18.c). Finally, when there is a very large amount of uncertainty with respect to the training sample 
size the true joint density cannot be adequately recovered (e.g., Cases 7 and 8, Fig. 18.d). 
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Figure 18. Illustration of the advantages of the Extended EM algorithm (see text), (a) 
Case (exact data), (b) Cases 1 and 2 (biased data), (c) Cases 5 and 6 (moderated noise), 
(d) Cases 7 and 8 (large noise). AH figures show the true mixture components (gray el- 
hpses), the available raw observations (black and white squares), the components esti- 
mated by Basic EM from the raw observations (dotted ellipses) and the components es- 
timated by Extended EM taking into account the hkeUhood functions of the uncertain 
values (black ellipses). 



Note that the ability to learn from uncertain information suggests a method to manage non 
random missing attributes (e.g., censoring) (Ghahramani & Jordan 1994) and other complex 
mechanisms of uncertain data generation. As illustrated in Case 9, if the missing data generation 
mechanism depends on the value of the hidden attribute, it is not correct to assign equal likelihood 
to all components. In principle, statistical studies or other kind of knowledge may help to ascertain 
the likelihood of the true values as a function of the available observations. For instance, in Case 10 
we replaced the missing attributes of Case 9 by normal likelihoods y = 4±2 (i.e, "y is high"), im- 
proving the estimates of the mixture parameters. 
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2 2 12 1 





.48 


-.04 .03 2.11 1.00 1.00 


2.09 1.83 1.00 2.08 1.00 


20% 
40% 
60% 
80% 


.48 
.49 
. 45 
.49 


.16 -.03 1.91 1.01 0.96 
.14 -.16 1.78 .99 0.95 
.02 -.29 2.50 .78 1.00 

-.11 1.69 2.21 1.73 .50 


2.31 2.09 .88 2.10 1.00 
2.39 2.29 .94 2.14 1.00 
1.86 2.01 1.03 1.77 1.00 
1.91 0.31 1.14 0.68 0.47 


1 

2 


. 4 8 
.48 


.06 .90 1.88 1.73 1.00 
-.04 .05 1.90 .92 1.00 


1.88 2.98 .96 2.40 1.00 
1.98 1.97 1.01 1.90 1.00 


3 

4 


. 4 7 

.49 


.Uz .(jy l.bU i.Uz .0/ 
.27 -.08 1.97 .90 .70 


z.UU i./D 1.14 Z.U/ U.o5 
2.10 2.06 1.01 1.94 0.71 


5 
6 


. 43 
. 54 


-.04 -.18 2.40 1.48 .82 
-.07 -.02 1.97 1.09 .56 


1.85 1.52 1.52 2.33 .80 
1.93 2.11 .85 1.69 .62 


r~ CO 


.46 
.79 


.15 -.16 2.73 2.53 .31 
.87 .08 2.09 1.52 .51 


1.94 1 . 62 2.47 2.90 .29 
1.96 3.61 0.87 1.21 .54 


9 

10 


.48 
. 45 


.32 -.02 1.77 1.10 1.00 
.00 .03 2.20 1.01 1.00 


1.92 0.67 0.94 1.22 1.00 
2.13 1.55 1.04 1.77 1.00 



Table 6. Parameter Estimates from Uncertain Information (see text) 



Example 14 : Learning from examples with missing attributes has been performed over the IRIS 
domain to illustrate the behavior of the MFGN framework. The whole data set was randomly 
divided into two subsets of equal size for training and testing. 5-component mixture models 
were obtained and evaluated, combining missing data proportions of 0% and 50%. The error 
prediction on attribute U {plant class) was the following: 





training set 


test set 


prediction error 




0% 


0% 


2.7% 


missing attributes 


0% 


50% 


12.0% 




50% 


0% 


4.0% 




50% 


50% 


18.7% 



In the relatively simple IRIS domain, the performance degradation due to 50% missing at- 
tributes is much greater in inference than in learning stage. The Extended EM algorithm is able 
to correctly recover the overall structure of the domain from the available information. 

4.4 Comments 

Convergence of the EM Algorithm is very fast, requiring no adjustable parameters such as learning 
rates. The algorithm is robust with respect to the random initial mixture parameters: bad local 
maxima are not frequent and alternative solutions are usually equally acceptable. All the examples 
contribute to all the components, which are never wasted by unfortunate initialization. For a fixed 
number of components, the algorithm progressively increases the likelihood / of the training data 
until a maximum is reached. When the number of components is incremented the maximum J also 
increases, until a limit value is obtained that cannot be improved using extra components (Fuku- 
naga 1990). Some simple heuristics can be incorporated to the standard Expectation-Maximization 
scheme to control the value of certain parameters (e.g., lower bounds can be established for vari- 
ances) or the quality of the model (e.g., mixture components can be eliminated if their proportions 
are too small). 

In our case, factorized components are specially convenient because matrix inversions are not 
required and, what is more important, uncertain and missing values can be correctly handled in a 
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simple and unified way, for heterogeneous attribute sets. It is not necessary to provide models for 
uncertain attribute correlations since no covariance parameters must be estimated. Finally, the 
training sample size must be large enough in relation to both the degree of uncertainty of the exam- 
ples and the complexity of the joint density model in order to obtain satisfactory approximations. 

On the other hand, the number of mixture components required for a satisfactory approxima- 
tion to the joint density must be specified. A pragmatic option is the minimization of the experi- 
mental estimation cost over the main inference task, if it exists. For instance, in regression we 
could increase the number of components until an acceptable estimation error is obtained over an 
independent data set (cross-validation). The same idea applies to pattern classification: use the 
number of components that minimizes the error rate over an independent test set. However, one of 
the main advantages of the proposed method is the independence between the learning stage and 
the inference stage, where we can freely choose and dynamically modify the input and output role 
of the attributes. Therefore, a global validity criterion is desirable. Some typical validation methods 
for mixture models are reviewed in McLachlan & Basford (1988); the standard approach is based 
on likelihood ratio tests on the number of components. Unfortunately, this method does not validate 
the mixture itself, only selects the best number of components (DeSoete 1993). 

Since the MFGN framework provides an explicit expression for the model p(z), we can apply 
statistical tests of hypothesis over an independent sample T taken from the true density (e.g. a sub- 
set of the examples reserved for testing) to find out if the obtained approximation is compatible 
with test data. If the hypothesis H= [T comes fi-om p(z) } is rejected, then the learning process must 
continue, possibly increasing the number of components. It is not difficult to build some statistical 
tests, e.g. over moments of p(z), because their sample means and variances can be directly obtained. 
However, as data sets usually include symbolic and numeric variables, we have also developed a 
test on the expected likelihood of the test sample, which measures how well p(z) "covers" the ex- 
amples. The mean and variance of p(z) can be easily obtained using the properties of generalized 
normals. Some experiments over simple univariate continuous densities show that this test is not 
very powerful for small sample sizes, i.e. incompatibility is not always detected, while other stan- 
dard tests significantly evidence rejection. Nevertheless, clearly inaccurate approximations are 
detected, results improve as the sample size increases and the test is valid for data sets with uncer- 
tain values. 

The Minimum Description Length (Li & Vitanyi 1993) principle can be also invoked to select 
the optimum number of components by trading-off the complexity of the model and the accuracy in 
the description of the data. 
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5. Discussion and Experimental Results 
5.1 Advantages of Joint Models 

Most inductive inference methods compute a direct approximation of the conditional densities of 
interest, or even obtain empirical decision rules without explicit models of the underlying condi- 
tional densities, hi these cases, both the model and the learning stage depend on the selected input / 
output role of the variables, hi contrast, we have presented an inference and learning method based 
on an approximation to the joint probability density function of the attributes by a convenient 
parametric family (a special mixture model). The MFGN framework works as a pattern completion 
machine operating over possibly uncertain information. For example, given a pattern classification 
problem, the same learning stage suffices for predicting class labels from feature vectors and for 
estimating the value of missing features from the observed information in incomplete patterns. The 
joint density approach finds the regions occupied by the training examples in the whole attribute 
space. The attribute dependences are captured at a higher abstraction level than the one provided by 
strictly empirical rules for pre-established target variables. This property is extremely useful in 
many situations, as shown in the following examples. 

Example 15 : Hints can be provided for inference over multivalued relations. Given the data 
set and model from Example 10, assume that we are interested in the value of x for y = 0. We 
obtain the bimodal marginal density shown in Fig. 19.a and the corresponding estimator x = 
0.2 ±1.4 which is, in some sense, meaningless. However, if we specify the branch of interest 
of the model, inferring from y = AND x = -1±1 (i.e., "x is small"), we obtain the unimodal 
marginal density in Fig. 19.b and the reasonable estimator x = -0.8±0.5. 




-12 -S -4 4 8 -12 -S -4 4 

X E-1 X E-1 



(a) (b) 
Figure 19. The desired branch in multivalued relations can be selected by providing 
some information about the output values, (a) Bimodal posterior density inferred from 
y=0. (b) Unimodal posterior density inferred fromy = and the hint "x is small". 



Example 16 : Image Processing. The advantages of a joint model supporting inferences from 
partial information on both inputs and outputs can be illustrated in the following appUcation 
with natural data (see Fig. 20). The image in Fig. 20. a is characterized by a 5-attribute density 
(x, >', R, G, E) describing position and color of the pixels. A random sample of 5000 pixels was 
used to build a 100-component mixture model. We are interested in the location of certain ob- 
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jects in the image. Figs. 20.b-f show the posterior density of the desired attributes given the fol- 
lowing queries: 

■ "Something light green". Fig. 20.b. Two groups can be easily identified in the posterior 
density, corresponding to the upper faces of the green objects*. S = Ci=[x, y unknown; 
/?=110±50, G=245+10, fi=160+50}. 

■ "Something Ught green OR dark red". Fig. 20.c. We find the same groups as above and an 
additional, more scattered group, corresponding to the red object. This greater dispersion 
arises from the larger size of the red object and also from the fact that the R component of 
dark red is more disperse than the G component of light green. S =Two equiprobable com- 
ponents with Ci as above and C2={x, y unknown; /?=110+10, G=B=30±50}. 

■ "Something light green on the right". Fig. 20. d. Here we provide partial information on the 
output: S = C3={jc=240+30; y unknown; /?=110+50, G=245+10, 5=160+50} 

■ "Something white". Fig. 20.e. S = C4=[x,y unknown; /?=245±10, G=245±10, B=245±10} 

■ "Something white, in the lower-left region, under the main diagonal (y<240-x)". Fig. 20. f. 
Here we provide relational information on the attributes that can be modeled by S = 6 equi- 
probable components (note that in this case the posterior distribution contains 600 compo- 
nents, but it is still computationally manageable) = 

{x=60+30, 3;=180+30, 7?=245+10, G=245+10, B=245+10}-i- 
{x=60+30, ^=120+30, /?=245+10, G=245+10, 5=245+10}+ 
{x=60+30, y=60+30, 7?=245+10, G=245+10, 5=245+10}+ 
{jc=120+30, y=120+30, 5=245+10, G=245+10, 5=245+10}+ 
{x=120+30, 7=60+30, 5=245+10, G=245+10, 5=245+10}-!- 
{;c=180+30, 7=60+30, 5=245+10, G=245+10, 5=245+10} 

In all cases, the posterior density is consistent with the structure of the original image. The time 
required to compute the posterior distribution is always lower than one second. Learning time 
was of order of hours in a Pentium 100 system. Simpler models (25-component, obtained from 
1000 random pixels) produced also acceptable results with much lower learning time. Further- 
more, the EM algorithm can be efficiently parallelized. 

On the other hand, when there is a large number of irrelevant attributes, the joint model strat- 
egy wastes resources to capture a proper probability density function along unnecessary dimen- 
sions. (This problem does not arise in the specification of a likelihood function, since only the rele- 
vant attributes explicitly appear in the model.) Joint modeling is appropriate for domains with a 
moderated number of "meaningful" variables without fixed input / output roles. 



* Note that a sharp peak (a component with smaU dispersion) was obtained in the learning process, which also "transmits" 
to the posterior density. This kind of artifacts are inocuous and can be easily removed by post processing. 
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(e) (f) 
Figure 20. Inference results for the image domain in Example 16. (a) source image, 
(b) posterior density of attributes x-y given "Something light green", (c) the same for 
"Something light green OR dark red", (d) for "Something light green on the right", 
(e) for "Something white", (f) for "Something white, in the lower-left region, under the 
main diagonal of the image (Y<240-X)" 



5.2 Advantages of Factorization 

The proposed methodology is supported by the general density approximation property of mixture 
models. We use components with independent variables in order to make computations feasible in 
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the inference and learning stage. Factorized components can be imposed to a mixture model with- 
out loss of generality. Any statistical dependence between variables can be still captured, at the cost 
of a possibly larger number of components in the mixture to achieve the required accuracy in the 
approximation. 

The simplicity of the "building block" structure is entirely compensated by an important sav- 
ing in computation time. High-dimensional integrals are analytically computed from univariate 
integrals and matrix inversions are avoided in the learning stage. Additionally, high-dimensional 
domains can be easily modeled using a small number of parameters in each mixture component. 
From the viewpoint of Computational Learning Theory (Vapnik 1995), models with a small num- 
ber of adjustable parameters (actually, with low "expressive power") have favorable consequences 
for generalization. 

Mixtures of factorized components are also used in Latent Class Analysis (DeSoete 1993), a 
well-known unsupervised classification technique. It is assumed that the statistical dependences 
between attributes can be fully explained by a hidden variable specifying the "latent class" of each 
example. This method is similar to the Gaussian decomposition clustering algorithm mentioned in 
Section 1, constrained to component-conditional attribute independence. However, our goal is not 
unsupervised classification but obtaining an accurate and mathematically convenient expression for 
the joint density of the variables, required to derive the desired estimators. The meaning of the 
components is irrelevant, as long as the whole mixture is a good approximation to the joint density. 

More expressive architectures, which combine mixture models with local dimensionality re- 
duction, have been also considered: Mixtures of Linear Experts (Jordan & Jacobs 1994), Mixtures 
of Principal Component Analyzers (Sung & Poggio 1998) or Mixtures of Factor Analyzers (Ghah- 
ramani & Hinton 1996, Hinton, Dayan, & Re vow 1997). Unfortunately, the general kind of infer- 
ence and learning from uncertain data considered in this work cannot be directly incorporated into 
these architectures with the computational advantages demonstrated by the MFGN model. 

The restriction to factorized components may produce undesirable artifacts in the approxima- 
tions of certain domains learned from small training samples. Nevertheless, this problem always 
occurs to any approximator when the structure of the building block does not match the "shape" of 
the target function. In this case, many terms (or components, units, etc.) are required for a good 
approximation and the associated parameters can be correctly adjusted only from a large training 
sample. However, note that the complexity of the model should not be measured uniquely in terms 
of the number of mixture components. The number of adjustable parameters is probably a better 
measure of complexity. For instance, full covariance models show a quadratic growth of the num- 
ber of free parameters with respect to the dimension of the attribute vector. For factorized compo- 
nents the growth is linear, so the amount of training data need not be unreasonably high even if the 
number of mixture components is large. 

In real applications, the nature of the target function is unknown, so little can be said a priori 
about the best building block structure to be used by a universal approximator. We have chosen a 
very simple component structure to make inference and learning feasible from uncertain informa- 
tion. Section 5.4 provides experimental evidence that in realistic problems the proposed model is 
not inferior to other popular approaches. 

5.3 Qualitative Comparison with Alternative Approaches 

Instead of the proposed methodology, based on mixture models and the EM algorithm, other alter- 
native nonparametric density approximation methods could also be used (either for the joint density 
or for specific conditional densities). For instance, the nearest neighbor rule locally approximates 
the target density using a certain number of training samples near to the point of interest. Symbolic 
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attributes are directly estimated by a voting scheme and continuous attributes can be also estimated 
by averaging the observed values of training instances which are near, in the subspace of observed 
attributes, to the point of interest. However, for small sample sizes, the above estimators are not 
smooth and show strong sensitivity to random fluctuations in the training set, which penalizes the 
estimation cost. For large sample size, the time required to find the nearest neighbors becomes very 
long. As an example, consider the regression problem in Example 10, Section 3.5. Fig. ll.b shows 
the MFGN solution with 4 components and MSE=0.381. Fig. 21. a shows the regression line ob- 
tained by 5-nearest-neighbors average, with a higher MSE=0.522. 

Parzen windows and similar kernel approximation methods are used to smooth the results of 
the simple nearest neighbors rule (Duda & Hart 1973, Izenman 1991). They are actually mixtures 
of simple conventional densities located at the training samples. In principle, the properties of the 
MFGN framework could be adapted to that kind of approximation (Ruiz et al. 1998). Learning 
becomes trivial, but strong run time computation effort is required since a "concise" model of the 
domain is not extracted from the training set. This kind of rote learning has also negative conse- 
quences on generalization according to the Occam Razor Principle (Li & Vitanyi 1993). An ade- 
quately cross-validated mixture model with a small number of components in relation to the train- 
ing sample size reasonably guarantees that probably the true attribute dependencies are correctly 
captured. 




Figure 21. Alternative solutions in regression and classification (see text for details). 



The nature of the solutions obtained by Backpropagation Multilayer Perceptrons (Rumelhart et 
al. 1986) in pattern classification is also illustrative. In general, each decision region can be geo- 
metrically expressed as the union of intersections of several half-spaces defined by the units in the 
first hidden layer. However, backprop networks often require very long learning times, many ad- 
justable parameters and, what is worse, apparently simple distributions of patterns are hard to learn. 
For instance, the solution to the circle -ring classification problem in Fig. 21.b, obtained by a net- 
work with 6 hidden units requires hundreds of standard backprop epochs. The decision regions are 
not very satisfactory, even though the network has extra flexibility for this task (3 hidden units 
suffice to separate the training examples). Better solutions exist using all the resources in the net- 
work architecture, but backprop learning does not find them. In contrast, the solution obtained by 
the MFGN approach using 7 components (Fig. 21. c) requires a learning time orders of magnitude 
shorter than backprop optimization. All the components in the mixture contribute to synthesize 
reliable decision regions and acceptable solutions can be also obtained with a smaller number of 
components. 
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The proposed approach is closely related to a well-known family of approximation techniques 
which, essentially, distribute (using some kind of clustering or self-organizing algorithm) "detec- 
tors" over the relevant regions of the input space and then combine their responses for computing 
the desired outputs. This is the case of Radial Basis Functions (RBF) (Hertz et al.), the classifica- 
tion and regression trees proposed in (Breiman et al. 1984) and the topological maps used in (Cher- 
kassky & Najafi 1992) to locate the "knots" required for piecewise linear regression. 

A relevant methodology is proposed in (Jordan & Jacobs 1994, Peng et al. 1995), where the 
EM algorithm is used to learn hierarchical mixtures of experts in the form of linear rules in such a 
way that the desired posterior densities can be explicitly obtained. The properties of the EM algo- 
rithm are also satisfactorily used in (Ghahramani & Jordan 1994) to obtain unbiased approxima- 
tions from missing data in a mixture -based framework similar to ours. Our framework extends this 
successful approach by exploiting the conjugate properties of the chosen universal approximation 
model: uncertain information of arbitrary complexity can be efficiently processed in the inference 
and learning stages. 

The MFGN framework is appropriate for a moderated number of variables showing relatively 
complex dependencies. In contrast, Bayesian Networks satisfactorily addresses the case of a large 
number of variables with clear conditional independence relations. There are situations in which a 
certain subset of the variables in a Bayesian Network shows no explicit causal structure. This sub- 
domain could be empirically modeled by a mixture model in order to be considered later as a com- 
posite node embedded in the whole network. If the subdomain can be conditionally isolated from 
the rest of variables through a set of communication nodes, the MFGN framework can be used to 
perform the required inferences. 

Finally, mixture models are typically used for unsupervised classification: the examples are 
labeled with the index of the component with highest posterior probability. In fact, the MFGN 
framework explicitly finds clusters in the training set. Furthermore, continuous and symbolic at- 
tributes are allowed in the joint density, so the examples are clustered using an implicit probabilis- 
tic metric which automatically weighs all the (heterogeneous) attributes, even with missing and 
uncertain values. However, this method is effective only when the groups of interest have the same 
structure as the component densities. In order to simplify inference the mixture components have 
been selected with constraints (Gaussian, independent variables) which are not necessarily verified 
by the "natural" groups found in real applications. 

A tentative possibility (inspired in a common heuristic clustering technique) consists of join- 
ing overlapping components (e.g., according to the Battachariya distance, a well-known bound on 
the Bayes error used in Statistical Pattern Recognition (Fukunaga 1990)). Unfortunately, our ex- 
periments indicate that the overlapping threshold is a free parameter that strongly determines the 
quality of the results. A universal threshold, independent of the application, does not seem to exist. 
In principle, clusters of arbitrary geometry may be discovered, but this cannot be easily automated. 
Therefore, other nonparametric cluster analysis methods (e.g. density valley seeking) are suggested 
for labeling complex groups. 

5.4 Experimental Evaluation 

The MFGN method has been evaluated on standard benchmarks from the Machine Learning data- 
base repository at the University of California, Irvine (Merz and Murphy 1996). It contains induc- 
tive learning problems which are representative of real world situations. We have experimented 
with the following databases: Ionosphere, Pima Indians, Monk's Problems, and Horse Colic, which 
illustrate different properties of the proposed methodology. In most cases MFGN has been com- 
pared to alternative learning methods with respect to the inference task considered of interest in 
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each problem (typically, prediction of a specific attribute given the rest of them). We usually give 
the error rate over both the training and the test set to indicate the amount of overfitting obtained by 
the learning algorithms. 




(a) Ionosphere (b) Pima Indians 

Figure 22. Most discriminant 2D projections of two representative databases. 



5.4.1 Ionosphere Database 

Two classes of radar returns from the ionosphere must be discriminated from vectors of 32 cm- 
tinuous attributes^. There are 351 examples, randomly partitioned into two disjoint subsets of ap- 
proximately equal size for training and testing. The prevalence of the minoritary class (random 
prediction rate) is 36%. Figure 22.a and Table 7 show that this is a typical statistical pattern recog- 
nition problem, easily solvable by standard methods. The results suggest that the Bayes (optimum) 
error probability is around 5%. 





error rate 


Pe 


METHOD 


(training set) 


(test set) 


Linear MSB (pseudoinverse) 


.11 


.14 


1-1 Nearest Neighbor 




.13 


2-3 Nearest Neighbor 




.18 


Parzen Model 


.05 


.08 


Backprop Multilayer Perceptron 2 hidden units 


.00 


.08 


Support Vector Machine, RBF kernel, width 1, (105 s.v.) 




.05 


Support Vector Machine, RBF kernel, width 3, (35 s.v.) 




.09 


Support Vector Machine, pohnomial kernel, order 2, (41 s.v.) 




.13 


Support Vector Machine, polinomial kernel, order 3, (45 s.v.) 




.17 


Support Vector Machine, polinomial kernel, order 4, (42 s.v.) 




.20 


Full covariancc gaussian mixture, 1 componenl/class 


.03 


.11 


Full covariance gaussian mixture, 2 component/class 


.01 


.19 


Full covariance gaussian mixture, 3 component/class 


.005 


.26 


MFGN 4 components (average) 


.22±.15 


.21±.08 


MFGN 8 components (average) 


.11±.06 


.13±.06 


MFGN 15 components (average) 


.10±.05 


.13±.06 


MFGN, best result by cross-validation (8 components) 


.07 


.06 



Table 7. Ionosphere Database Results 



^ Originally the database contains 34 attributes. Two of them, meaningless or ill behaved, were eliminated. 
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In this problem, the plain MFGN method, without special heuristics in the learning stage, is 
comparable in average to the alternative methods. The best solution on the training set (cross- 
validation) is entirely satisfactory. 

For the Ionosphere database we also present an exhaustive study of performance given varying 
proportions of missing values in the training and testing examples. A value of x % means that in all 
training or test examples the value of each attribute is deleted with probability x. The basic experi- 
ment consists of learning a MFGN model with the prescribed number of components (4, 8 and 15) 
and computing the error rate on the training and test sets. Table 8 shows the mean value + 2 stan- 
dard deviations of the error rates obtained in 10 repetitions of the basic experiment in each configu- 
ration. Column M contains the error rate of each configuration over its own training set. The train- 
ing/test partition is kept fixed to analyze the variability of the solutions due to random initialization 
of the EM. 



LEARNING 


EVFERENCE 




M 


0% 


10% 


25% 


50% 


4 COMP. - 0% 


22 ±15 


21 ±8 


21 ±8 


22 ±8 


22 ±9 


8 COMP. - 0% 


11 ±6 


13±6 


12±6 


13±5 


12±6 


15 COMP. - 0% 


10 + 5 


13±6 


13±6 


13±5 


13±4 


4 COMP. - 10% 


21 ± 14 


23± 11 


23 ±11 


23 ±11 


23± 11 


8 COMP. - 10% 


11 ±3 


13±5 


12±5 


13±5 


13±4 


15 COMP. - 10% 


10 ±3 


12 ±6 


12±7 


12 ±6 


I2±3 


4 COMP. - 25% 


18±7 


19 ±5 


19 ±5 


18±6 


18 ±6 


8 COMP. -25% 


12±7 


14 ± 10 


14 ±9 


15 ±8 


14 ±7 


15 COMP. -25% 


9±5 


12±9 


13± 11 


13±9 


13±7 


4 COMP. - 50% 


27± 18 


26± 15 


27± 15 


27± 14 


26± 13 


8 COMP. - 50% 


16± 12 


21 ± 15 


21 ± 15 


21 ± 13 


20± 11 


15 COMP. - 50% 


13±6 


26± 17 


25 ±15 


25 ±14 


23± 13 



Table 8. Evaluation of MFGN on Ionosphere Database given 
different proportions of missing data in the training and testing subsets. 



As expected, the MFGN model is robust with respect to large proportions of missing values in 
the test patterns, and to moderated proportions of missing data in the training set. We have com- 
pared the above behavior with a standard algorithm for Decision Tree construction inspired in 
(Quinlan 1993), which is also able to support missing values^. Table 9 shows the error rates of the 
decision trees for the same experimental setting as in Table 8. This kind of Decision Tree obtains 
error rates that are better than the averages obtained by MFGN. However, MFGN's best solutions 
(selected by cross-validation) are better than the ones obtained by Decision Tree. Furthermore, 
Decision Tree performance degrades faster than MFGN, especially with respect to the proportion 
of missing values in the inference stage. 



Essentially, missing values are handled as follows. In the learning stage, when an attribute is selected, examples with 
missing values are sent to all the partitions with appropriate weights. In the inference stage, if a node asks for a missing 
value, it follows all the branches with appropriate weights and finally the outputs are combined. 
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LEARNING 


INFERENCE 




M 


% 


10 % 


25 % 


50 % 


% 


1 % 


9 % 


10 % 


11 % 


12 % 


10 % 


5 % 


14 % 


15 % 


19 % 


18 % 


25 % 


6 % 


15 % 


17 % 


17 % 


18 % 


50 % 


8 % 


17 % 


18 % 


18 % 


19 % 



Table 9. Evaluation of basic Decision Tree on Ionosphere Database given different pro- 
portions of missing data in the training and testing subsets. 



5.4.2 Pima Indians Database 

In this problem we must discriminate between two possible results of a diabetes test given to Pima 
Indians. There are 8 continuous attributes, and 768 examples, randomly partitioned into two dis- 
joint subsets of equal size for training and testing. The prevalence of the minority class is 35%. The 
attribute vector has been normalized. Table 10 presents comparative results. 





error rate 


Pe 


METHOD 


(training set) 


(test set) 


Linear MSB (pseudoinverse) 


.22 


.23 


Oblique Decision Tree 8 decision nodes 


.18 


.24 


1-1 Nearest Neighbor 




.30 


2-3 Nearest Neighbor 




.25 


Full covariance gaussian mixture, 1 component/class 


.24 


.26 


Full covariance gaussian mixture, 2 component/class 


.19 


.29 


Full covariance gaussian mixture, 3 component/class 


.17 


.30 


Full covariance gaussian mixture, 4 component/class 


.17 


.31 


Backprop Multilayer Perceptron 2 hidden units 


.17 


.25 


Backprop Multilayer Perceptron 4 hidden units 


.14 


.24 


Backprop Multilayer Perceptron 8 hidden units 


.05 


.29 


Support Vector Machine, RBF kernel, width 1 (297 s.v.) 




.30 


Support Vector Machine, RBF kernel, width 3 (176 s.v.) 




.35 


Support Vector Machine, polynomial kernel, order 4 (138 s.v.) 




.36 


Support Vector Machine, polynomial kernel, order 5 (131 s.v.) 




.34 


MFGN 4 components 


.28 


.35 


MFGN 6 components 


.25 


.32 


MFGN 8 components 


.29 


.35 



Table 10. Pima Indians Database Results 



Despite of low dimensionality and large number of examples, this classification problem is 
hard (see Figure 22.b). Even sophisticated learners such as backpropagation networks, decision 
trees or support vector machines, which are able to store a reasonable proportion of the training set, 
do not achieve significant generalization. MFGN shows a similar behavior, although it is slightly 
less prone to overfitting (the error rate on the training set is not misleading). 

5.4.3 HORSE Colic database 

This database contains a classification task from a heterogeneous attribute vector including sym- 
bolic, discrete and continuous variables, with 30% missing values. It illustrates the problem of 
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feature selection in the context of joint modeling, mentioned in Section 5.1. Table 11 shows the 
error rates obtained by MFGN using different attribute subsets^. To take advantage of its general 
inference properties, the MFGN model must be applied to the attribute subset of interest. If the 
inference task is fixed and the number of attributes is very large, alternative methods should be 
used. 



METHOD 


Pe 

(distribution, 10 initializa- 
tions) 


Pe 

(best) 


6 Selected attributes 






MFGN 2 components 


.32±.00 


.dZ 


MFGN 3 components 


.20±.05 




MFGN 4 components 


.19±.02 


1 C 

• lo 


MFGN 5 components 


.20±.02 




MFGN 6 components 


.19±.03 


1 ^ 


MFGN 7 components 


.20±.04 


1 o 
.10 


MFGN 10 components 


.22±.04 


1 o 

.18 


MFGN 12 components 


.19±.03 


.lo 


MFGN 15 components 


.19±.04 


.ID 








8 Selected attributes 






MFGN 4 components 


.22±.01 


91 


MFGN 6 components 


.21±.02 


.19 


MFGN 8 components 


.21±.03 


.18 


MFGN 10 components 


.23±.02 


.18 


MFGN 12 components 


.21±.02 


.18 


MFGN 15 components 


.21±.02 


.18 








23 Selected attributes 






MFGN 6 components 


.28±.02 


.25 


MFGN 8 components 


.29±.03 


.25 


MFGN lOcomponcnls 


.34±.08 


.25 


MFGN 15 components 


.34±.06 


.28 



Table 11. Horse Colic Database Results (random rate = .5) 



5.4.4 MONK'S Problems 

The Monk's problems are three concept learning tasks from 6 symbolic attributes, widely used as 
benchmarks for inductive learning algorithms (Thrun et al. 1991). As seen in Table 12, MFGN fails 
on MONKl (where acceptable generalization is not obtained) and MONK2 (where the training 
examples cannot even be stored). In contrast, MFGN correctly solves MONKS. This behavior is 
related to the fact that the MONK's problems are based on deterministic or abstract concepts which 
may lack the kind of geometric regularities in the attribute space required by probabilistic models'*^. 



' Features were individually selected using a simple discrimination index related to the Kohnogorov-Smimov statistic 
(Ruiz 1995). 

A typical example is the parity problem: acceptable off-training-set generaUzation cannot be achieved if the inductive 
bias of the learning machine is biased towards "smooth" solutions. 
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Fig. 23 shows the most discriminant 2D projections of the datasets and illustrates the fact that 
M0NK2 cannot be easily captured by statistic techniques. In this benchmark, MFGN performance 
is similar to that of other popular probabilistic methods (Thrun et al. 1991). 



i 



t 

> ; 



i • 



(a) MONKl (b) MONK2 

Figure 23. Most Discriminant 2D Projections of the Monk's Datasets. 



(c) MONKS 



METHOD 


error rate 
( training set) 


Pe 
(test set) 








MONKl (random rate = .5) 






Linear MSE (pseudoinverse) 


.29 


.34 


1-1 Nearest Neighbor 




.17 


Support Vector Machine, RBF kernel, width 1 (78 s.v.) 




.08 


Cascade Correlation 







MFGN 4 components 


.06 


.40 


MFGN 8 components 


.00 


.33 








MONK2 (random rate = .4) 






Linear MSE (pseudoinverse) 


.40 


.37 


1-1 Nearest Neighbor 




.19 


Support Vector Machine, RBF kernel, width 1 (117 s.v.) 




.20 


Cascade Correlation 







MFGN 4 components 


.31 


.38 


MFGN 8 components 


.26 


.44 


MFGN 15 components 


.14 


.50 








MONK3 (random rate = .5) 






Linear MSE (pseudoinverse) 


.19 


.19 


1-1 Nearest Neighbor 




.18 


Support Vector Machine, RBF kernel, width 1 (69 s.v.) 




.08 


Cascade Correlation 




.03 


MFGN 2 components 


.07 


.03 


MFGN 4 components 


.04 


.03 


MFGN 8 components 


.03 


.08 



Table 12. Monk's Problems Results 
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5.4.5 Comments 

The above experiments demonstrate that the MFGN model is able to obtain acceptable results on 
many real world applications. In particular, the error rates obtained in standard classification tasks 
are comparable to those obtained by other popular learners. Additionally, MFGN is able to perform 
inferences over any other attribute given uncertain or partial information, which is not possible for 
most of the alternative methods. This property makes MFGN a very attractive alternative for many 
inference problems such as the one illustrated in Example 16. The experiments have also contrib- 
uted to characterize the kind of problems for which the MFGN model is best suited. Essentially, the 
relationship among attributes must be of a true probabilistic nature, and the attribute vector must be 
of a moderated size contaiiring "relevant" variables. A previous feature selection / accommodation 
stage is recommended in certain applications. 

6. Conclusions 

We have developed an efficient methodology for probabilistic inference and learning from uncer- 
tain information. Under the proposed MFGN framework, the joint probability density function of 
the attributes and the likelihood function of the available information are approximated by Mix- 
tures of Factorized Generalized Normals. This mathematical structure allows efficient computation, 
without numerical integration, of posterior densities and expectations of the desired variables given 
events of arbitrary "geometry". An extended version of the EM learning algorithm has been devel- 
oped to estimate the parameters of the required mixture models from uncertain training examples. 
Different paradigms as pattern recognition, regression or pattern completion are subsumed under a 
common framework. 

A comprehensive collection of examples illustrates the methodology, which has been critically 
compared with alternative techniques. The Extended EM algorithm is able to learn satisfactory 
domain models from a reasonable number of examples with uncertain values, taking into account 
the explicit likelihood functions of the available information. Results are satisfactory whenever the 
sample size is large in relation to the amount of (known) degradation of the training set. The ex- 
periments also characterized the kind of situations that the model manages better: Domains de- 
scribed by a moderate number of heterogeneous attributes with complex probabilistic dependences, 
problems in which the output variables are not necessarily known in the learning stage (i.e. pattern 
completion), and, finally, problems in which an explicit management of uncertainty is needed, ei- 
ther in the learning or in the inference stage (or even in both). The MFGN framework has obtained 
a very favorable trade-off between useful features and model complexity in the solutions to differ- 
ent applications and benchmarks. 

Future developments of our work include improving the learning stage with some heuristic 
steps that are combined with the standard E and M steps to control the adequacy of the acquired 
models. Additional studies are required on validation tests, generalization, scalability, robustness 
and data preprocessing. The essential idea of working with explicit likelihood functions will be 
incorporated into the Parzen approximation scheme and we are also interested in more expressive 
model structures such as mixtures of factor analyzers, principal component analyzers or linear ex- 
perts. Finally, the methodology can be developed in a pure Bayesian framework or subsumed under 
the Dempster-Shafer Evidence Theory. 
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